Climate Impact on Agricultural Productivity

Project Report for STAT 303-1 Fall 2025

Author

Sophia Perez

Published

December 9, 2025

Abstract

This study integrates four analytical approaches to assess how climate change influences crop yields and broader economic outcomes. First, these analytical questions allow for the study of how climate and agriculture vary over time and on a regional level. Across analyses, temperature patterns and extreme weather events consistently emerged as the primary drivers of agricultural performance. While global temperature trends showed no strong temporal patterns, regional analyses revealed substantial geographic variation, indicating that climate impacts are more meaningfully interpreted at the regional scale. Furthermore, the analysis questions worked to answer how well climate change variables correlated with crop yield and how they affect the economy. A key global finding shows that crop yields peak within an average temperature range of 10–20°C. Activity of Very High extreme weather event levels has also increased in recent years, is most frequent in the United States, and coincides with relatively higher U.S. crop yields under severe conditions. Economic impact also showed a strong positive linear relationship with crop yield. These findings together highlight the need for region-specific climate policies and agricultural investments that enhance crop and soil resilience against climate change events, which can overall have an impact on economic welfare.

3) Data

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
Code
# Describing the Data Source
data = pd.read_csv('climate_change_impact_on_agriculture_2024.csv')
data.head()
Year Country Region Crop_Type Average_Temperature_C Total_Precipitation_mm CO2_Emissions_MT Crop_Yield_MT_per_HA Extreme_Weather_Events Irrigation_Access_% Pesticide_Use_KG_per_HA Fertilizer_Use_KG_per_HA Soil_Health_Index Adaptation_Strategies Economic_Impact_Million_USD
0 2001 India West Bengal Corn 1.55 447.06 15.22 1.737 8 14.54 10.08 14.78 83.25 Water Management 808.13
1 2024 China North Corn 3.23 2913.57 29.82 1.737 8 11.05 33.06 23.25 54.02 Crop Rotation 616.22
2 2001 France Ile-de-France Wheat 21.11 1301.74 25.75 1.719 5 84.42 27.41 65.53 67.78 Water Management 796.96
3 2001 Canada Prairies Coffee 27.85 1154.36 13.91 3.890 5 94.06 14.38 87.58 91.39 No Adaptation 790.32
4 1998 India Tamil Nadu Sugarcane 2.19 1627.48 11.81 1.080 9 95.75 44.35 88.08 49.61 Crop Rotation 401.72
Code
#Quick Analysis:
print('number of observations',len(data))
print()
display('list of columns:',list(data.columns))
print()
display('data shape',data.shape)
print()
display(data.describe())
print()
display(data.dtypes)
number of observations 10000
'list of columns:'
['Year',
 'Country',
 'Region',
 'Crop_Type',
 'Average_Temperature_C',
 'Total_Precipitation_mm',
 'CO2_Emissions_MT',
 'Crop_Yield_MT_per_HA',
 'Extreme_Weather_Events',
 'Irrigation_Access_%',
 'Pesticide_Use_KG_per_HA',
 'Fertilizer_Use_KG_per_HA',
 'Soil_Health_Index',
 'Adaptation_Strategies',
 'Economic_Impact_Million_USD']
'data shape'
(10000, 15)
Year Average_Temperature_C Total_Precipitation_mm CO2_Emissions_MT Crop_Yield_MT_per_HA Extreme_Weather_Events Irrigation_Access_% Pesticide_Use_KG_per_HA Fertilizer_Use_KG_per_HA Soil_Health_Index Economic_Impact_Million_USD
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
mean 2007.088700 15.241299 1611.663834 15.246608 2.240017 4.980900 55.248332 24.955735 49.973708 64.901278 674.269658
std 10.084245 11.466955 805.016815 8.589423 0.998342 3.165808 25.988305 14.490962 28.711027 20.195882 414.591431
min 1990.000000 -4.990000 200.150000 0.500000 0.450000 0.000000 10.010000 0.000000 0.010000 30.000000 47.840000
25% 1999.000000 5.430000 925.697500 7.760000 1.449000 2.000000 32.677500 12.527500 25.390000 47.235000 350.545000
50% 2007.000000 15.175000 1611.160000 15.200000 2.170000 5.000000 55.175000 24.930000 49.635000 64.650000 583.920000
75% 2016.000000 25.340000 2306.997500 22.820000 2.930000 8.000000 77.582500 37.470000 74.825000 82.472500 917.505000
max 2024.000000 35.000000 2999.670000 30.000000 5.000000 10.000000 99.990000 49.990000 99.990000 100.000000 2346.470000
Year                             int64
Country                         object
Region                          object
Crop_Type                       object
Average_Temperature_C          float64
Total_Precipitation_mm         float64
CO2_Emissions_MT               float64
Crop_Yield_MT_per_HA           float64
Extreme_Weather_Events           int64
Irrigation_Access_%            float64
Pesticide_Use_KG_per_HA        float64
Fertilizer_Use_KG_per_HA       float64
Soil_Health_Index              float64
Adaptation_Strategies           object
Economic_Impact_Million_USD    float64
dtype: object
Code
# Numeric and Categorical Variables:
numeric_cols = data.select_dtypes(include='number').columns.tolist()
categorical_cols = data.select_dtypes(include=object).columns.tolist()

display('Numerical Columns:',numeric_cols)
print('Number of Numeric Columns:', len(numeric_cols))
print()
display('Categorical Columns:',categorical_cols)
print('Number of Numeric Columns:', len(categorical_cols))
'Numerical Columns:'
['Year',
 'Average_Temperature_C',
 'Total_Precipitation_mm',
 'CO2_Emissions_MT',
 'Crop_Yield_MT_per_HA',
 'Extreme_Weather_Events',
 'Irrigation_Access_%',
 'Pesticide_Use_KG_per_HA',
 'Fertilizer_Use_KG_per_HA',
 'Soil_Health_Index',
 'Economic_Impact_Million_USD']
Number of Numeric Columns: 11
'Categorical Columns:'
['Country', 'Region', 'Crop_Type', 'Adaptation_Strategies']
Number of Numeric Columns: 4

5) Data Cleaning

Code
# Any missing values in the dataset?
missing = data.isnull().sum()
print(missing)
Year                           0
Country                        0
Region                         0
Crop_Type                      0
Average_Temperature_C          0
Total_Precipitation_mm         0
CO2_Emissions_MT               0
Crop_Yield_MT_per_HA           0
Extreme_Weather_Events         0
Irrigation_Access_%            0
Pesticide_Use_KG_per_HA        0
Fertilizer_Use_KG_per_HA       0
Soil_Health_Index              0
Adaptation_Strategies          0
Economic_Impact_Million_USD    0
dtype: int64
Code
# Since there are no missing values, we will proceed with checking for outliers for the entire dataset
    # Each part will address whether or not to keep the outliers for certain vars

# Define a function to return all outliers in each col:
def outliers_func(x): # x is a var
    Q1 = np.percentile(x,25)
    Q3 = np.percentile(x,75)
    IQR = Q3 - Q1
    lower_fence = Q1- 1.5*IQR
    upper_fence = Q3 + 1.5*IQR
    return (x > upper_fence) | (x < lower_fence)

numeric = data.select_dtypes(include='number')
outliers = numeric.apply(outliers_func)

outliers.sum()

# There are only outliers in the "Economic_Impact_Million_USD" var
Year                             0
Average_Temperature_C            0
Total_Precipitation_mm           0
CO2_Emissions_MT                 0
Crop_Yield_MT_per_HA             0
Extreme_Weather_Events           0
Irrigation_Access_%              0
Pesticide_Use_KG_per_HA          0
Fertilizer_Use_KG_per_HA         0
Soil_Health_Index                0
Economic_Impact_Million_USD    177
dtype: int64

a) Cleaning - Question 1

Code
# Clean all vars excluding country, region, pesticide use, fertilizer use, adaptation strategies, economic_impact
# No missing values, so no imputing needed
# No outliers in these vars, so no adjustment
# 1) Since we are analyzing trends over the years, we really only want to bin "Extreme_Weather_Events" (explained more in the report)
# 2) Check for any inconsistent data in the vars we are using
    # These vars cannot be negative: Extreme_Weather_Events, Irrigation_Access_%, Crop_Yield, Total_Precipitation
    # Check for consistent data_types in vars (all floats, ints, strings, etc.) --- checked in "Data" section
    # Percentages can't be outside 0-100
Code
# Number of categories in "Crop Type"
display(data['Crop_Type'].unique())
print(data['Crop_Type'].nunique())
array(['Corn', 'Wheat', 'Coffee', 'Sugarcane', 'Fruits', 'Rice', 'Barley',
       'Vegetables', 'Soybeans', 'Cotton'], dtype=object)
10
1) Binning Extreme_Weather_Events
Code
import seaborn as sns
import matplotlib.pyplot as plt
Code
# Must first see if data is skewed to determine whether to use .cut or .qcut

plt.hist(data['Extreme_Weather_Events'])
plt.xlabel('Extreme Weather Events')
plt.ylabel('Frequency')
plt.show()

Code
# Skewed left so .qcut will be used

binned_extreme_weather = pd.qcut(data['Extreme_Weather_Events'], q=4, labels=['Low','Medium','High','Very High'],retbins = True)

data_cleaned_1 = data.copy()

data_cleaned_1['Extreme_Weather_Events'] = binned_extreme_weather[0]

2) Binning Soil_Health_Index:

Code
# There are previously defined ranges:
    # 0-59 --> poor/fair
    # 60-79 --> good
    # 80-100 --> excellent

bins = [0, 59, 79, 100]
labels = ['Poor/Fair', 'Good', 'Excellent']

binned_soil_health_idx = pd.cut(data_cleaned_1['Soil_Health_Index'], bins=bins, labels=labels, include_lowest=True,retbins = True)

data_cleaned_1['Soil_Health_Index'] = binned_soil_health_idx[0]

3) Check for Inconsistencies:

Code
# A) Check for negatives where they shouldn't be

numeric = data.select_dtypes(include='number')

neg_count = (numeric < 0).sum()

neg_count

# All vars are accurate (Temperature should have some negative values)
Year                              0
Average_Temperature_C          1185
Total_Precipitation_mm            0
CO2_Emissions_MT                  0
Crop_Yield_MT_per_HA              0
Extreme_Weather_Events            0
Irrigation_Access_%               0
Pesticide_Use_KG_per_HA           0
Fertilizer_Use_KG_per_HA          0
Soil_Health_Index                 0
Economic_Impact_Million_USD       0
dtype: int64
Code
### B) Percentages can't be outside 0-100

range_vals = numeric.min().astype(str) +'  -  '+numeric.max().astype(str)

range_vals

# All data is accurate
# Also note here that soil_health_index ranges from 30 to 100, extreme_weather ranges from 0-10 and year ranges from 1990 to 2024
Year                            1990.0  -  2024.0
Average_Temperature_C              -4.99  -  35.0
Total_Precipitation_mm         200.15  -  2999.67
CO2_Emissions_MT                     0.5  -  30.0
Crop_Yield_MT_per_HA                 0.45  -  5.0
Extreme_Weather_Events               0.0  -  10.0
Irrigation_Access_%               10.01  -  99.99
Pesticide_Use_KG_per_HA             0.0  -  49.99
Fertilizer_Use_KG_per_HA           0.01  -  99.99
Soil_Health_Index                  30.0  -  100.0
Economic_Impact_Million_USD     47.84  -  2346.47
dtype: object

b) Cleaning - Question 2

Code
# For this question, yearly trends will now be compared between regions
    # 1) Main task here will be to ensure country names and regions are consistent
    # 2) Check for equal time coverage across countries - (if Africa has 1990–2025 but South America has 2000–2025)
Code
# 1) Country names and regions are consistent
print('country names:')
print(data['Country'].unique())
    # Note here -- United States is listed as USA (good for referencing later)

print('regions:')
print(data['Region'].unique())
    # regions describe regions of each country, not regions of the world
    # Will need to pair them with the country to standardize
        # since there are not that many countries, looking at regions within a country may be more insightful than a global analysis
country names:
['India' 'China' 'France' 'Canada' 'USA' 'Argentina' 'Australia' 'Nigeria'
 'Russia' 'Brazil']
regions:
['West Bengal' 'North' 'Ile-de-France' 'Prairies' 'Tamil Nadu' 'Midwest'
 'Northeast' 'New South Wales' 'Punjab' 'North West' 'South East'
 'Grand Est' 'Northwestern' 'Siberian' 'Northwest' 'Victoria'
 'Nouvelle-Aquitaine' 'South' 'Quebec' 'Southeast' 'Ontario' 'East'
 'Pampas' 'Western Australia' 'Volga' 'Maharashtra'
 'Provence-Alpes-Cote d’Azur' 'West' 'Central' 'North Central' 'Patagonia'
 'Queensland' 'South West' 'British Columbia']
Code
# Pairing each region with its country
display(pd.DataFrame(data.groupby('Country')['Region'].unique()))
    # So there are multiple duplicates (ex. North Brazil, North China)

data_cleaned_2 = data_cleaned_1.copy()

# For each region in each observation, we want to pair it with its country
# We can make a function
def region_pair_func(x,y): # x is the country, y is the region
    return y + '(' + x + ')'

data_cleaned_2['Region'] = region_pair_func(data_cleaned_2['Country'],data_cleaned_2['Region'])

data_cleaned_2['Region'].head()
Region
Country
Argentina [Northeast, Northwest, Pampas, Patagonia]
Australia [New South Wales, Victoria, Western Australia,...
Brazil [North, Northeast, Southeast, South]
Canada [Prairies, Quebec, Ontario, British Columbia]
China [North, East, South, Central]
France [Ile-de-France, Grand Est, Nouvelle-Aquitaine,...
India [West Bengal, Tamil Nadu, Punjab, Maharashtra]
Nigeria [North West, South East, North Central, South ...
Russia [Northwestern, Siberian, Volga, Central]
USA [Midwest, Northeast, South, West]
0       West Bengal(India)
1             North(China)
2    Ile-de-France(France)
3         Prairies(Canada)
4        Tamil Nadu(India)
Name: Region, dtype: object
Code
# 2) Check for equal time coverage across countries - (if Africa has 1990–2025 but South America has 2000–2025)

data_cleaned_2.groupby('Region')['Year'].agg(lambda x: x.max() - x.min())
    # it looks like all regions have the same range of years (the max), so there are no yearly differences
Region
British Columbia(Canada)              34
Central(China)                        34
Central(Russia)                       34
East(China)                           34
Grand Est(France)                     34
Ile-de-France(France)                 34
Maharashtra(India)                    34
Midwest(USA)                          34
New South Wales(Australia)            34
North Central(Nigeria)                34
North West(Nigeria)                   34
North(Brazil)                         34
North(China)                          34
Northeast(Argentina)                  34
Northeast(Brazil)                     34
Northeast(USA)                        34
Northwest(Argentina)                  34
Northwestern(Russia)                  34
Nouvelle-Aquitaine(France)            34
Ontario(Canada)                       34
Pampas(Argentina)                     34
Patagonia(Argentina)                  34
Prairies(Canada)                      34
Provence-Alpes-Cote d’Azur(France)    34
Punjab(India)                         34
Quebec(Canada)                        34
Queensland(Australia)                 34
Siberian(Russia)                      34
South East(Nigeria)                   34
South West(Nigeria)                   34
South(Brazil)                         34
South(China)                          34
South(USA)                            34
Southeast(Brazil)                     34
Tamil Nadu(India)                     34
Victoria(Australia)                   34
Volga(Russia)                         34
West Bengal(India)                    34
West(USA)                             34
Western Australia(Australia)          34
Name: Year, dtype: int64

c) Cleaning - Question 3

Code
# How do climate change factors correlate with crop yield?
    # We can do this within each country since there may be global differences

# Same vars for the previous two questions
# Unrealistic values were checked in question 1

# However, it may be interesting to see how the relationship between climate variables and crop yield differs by crop type.
    # For this, we need to OHE crop type

data_cleaned_3 = data_cleaned_2.copy()

# Check to see the categories in crop type
print(data_cleaned_3['Crop_Type'].nunique())
print(data_cleaned_3['Crop_Type'].unique())

crop_type_OHE = pd.get_dummies(data_cleaned_3['Crop_Type'], drop_first= True)

# It can now be used for data analysis/visualization if i so choose
10
['Corn' 'Wheat' 'Coffee' 'Sugarcane' 'Fruits' 'Rice' 'Barley' 'Vegetables'
 'Soybeans' 'Cotton']

d) Cleaning - Question 4

Code
# How do climate change and agricultural practices collectively influence economic impact from crop production globally?
# 1) As we found earlier, there are many outliers in the economic_impact variable, which is introduced in this question

# Filter the outliers
def outliers_func(x): # x is a var
    Q1 = np.percentile(x,25)
    Q3 = np.percentile(x,75)
    IQR = Q3 - Q1
    lower_fence = Q1- 1.5*IQR
    upper_fence = Q3 + 1.5*IQR
    return (x > upper_fence) | (x < lower_fence)

print(outliers_func(data_cleaned_4['Economic_Impact_Million_USD']).sum())

data_cleaned_4 = data_cleaned_3.copy()

data_cleaned_4.loc[outliers_func(data_cleaned_4['Economic_Impact_Million_USD'])]
# When dealing with economic impact, a lot of these outliers may actually be plausible
# However, they may be unrealistic if the crop yield is low and the impact is very high
# Which of these outliers are above Tukey's fences?

def outliers_func(x): # x is a var
    Q1 = np.percentile(x,25)
    Q3 = np.percentile(x,75)
    IQR = Q3 - Q1
    lower_fence = Q1- 1.5*IQR
    upper_fence = Q3 + 1.5*IQR
    return (x > upper_fence)

print(outliers_func(data_cleaned_4['Economic_Impact_Million_USD']).sum())
    # All of them are high values (none are below 
177
177
Code
outliers = data_cleaned_4.loc[outliers_func(data_cleaned_4['Economic_Impact_Million_USD'])]

# Out of these outliers, which have low crop yields?
print('The lowest Crop Yield of all the outliers is:')
print(outliers['Crop_Yield_MT_per_HA'].min())

print('Average of Crop_Yield:')
print(data_cleaned_4['Crop_Yield_MT_per_HA'].mean())

# So the outliers make sense because all of the data in the outliers df has a crop yield above the average crop_yield
# We keep the outliers because it is plausible that a high economic impact is due to a high crop yield
The lowest Crop Yield of all the outliers is:
3.56
Average of Crop_Yield:
2.2400169
Code
# 2) To measure economic impact, it may also be useful to look at adaptation strategies
data_cleaned_4['Adaptation_Strategies'].unique()

# There is no cleaning to be done here because there are 5 nicely-grouped categories
array(['Water Management', 'Crop Rotation', 'No Adaptation',
       'Organic Farming', 'Drought-resistant Crops'], dtype=object)

6) Data Analysis

a) Analysis 1

Code
# There are 3 main categories for analysis:
    # 1) Crop yield over time and also per crop type
    # 2) Climate variables over time -- temperature, precipitation, CO2 emissions, and extreme weather
    # 3) Agricultural factors over time -- soil health index
            # irrigation access, pesticide/fertilizer use can be looked at separately to validate SHI
Code
# 1a) Crop yield over time

plt.figure(figsize=(12, 6))
sns.lineplot(data = data_cleaned_1, x = 'Year', y = 'Crop_Yield_MT_per_HA')
plt.show()
    # For each row of the same month value, the y-value is the average of the 'crop_yield'
    # The shaded region is the 95% CI of the each point

# 1b) Crop yield per crop type
plt.figure(figsize=(12,6))
crop_grouped = data_cleaned_1.groupby(['Year','Crop_Type'])['Crop_Yield_MT_per_HA'].agg('mean').reset_index()

sns.lineplot(data = crop_grouped, x = 'Year', y = 'Crop_Yield_MT_per_HA', hue = 'Crop_Type')
plt.show()

# The "per crop type" plot mirrors the overall plot and is a lot less clean
    # so not necessary for the report

Code
# 2) Climate variables over time -- temperature, precipitation, CO2 emissions, and extreme weather
    # For first three vars, we will do subplots

climate_vars = data_cleaned_1.loc[:,['Year','Average_Temperature_C','Total_Precipitation_mm','CO2_Emissions_MT']]

climate_melted = climate_vars.melt(id_vars = 'Year', var_name = 'Climate_Var', value_name = 'Values')

a = sns.FacetGrid(climate_melted, col = 'Climate_Var',col_wrap=1, height=5, aspect=2, sharey=False)
    # sharey ensures each has its own y axis
a.map_dataframe(sns.lineplot, x='Year', y='Values')
a.set_titles("{col_name}")
plt.show();

    # For extreme weather events, we can do a bar plot


# reminder -- data_cleaned_1['Extreme_Weather_Events'] = binned_extreme_weather[0]

extreme_counts = (data_cleaned_1.groupby(['Year', 'Extreme_Weather_Events']).size().reset_index(name='Count'))

plt.figure(figsize=(14,6))
sns.lineplot(data=extreme_counts, x='Year', y='Count', hue='Extreme_Weather_Events')
plt.title("Trend of Extreme Weather Events by Category")
plt.ylabel("Number of Observations")
plt.show()

# no binned data
plt.figure(figsize=(14,6))
sns.lineplot(data=data, x='Year', y='Extreme_Weather_Events')
plt.show()

Code
# Agricultural factors over time -- Soil Health Index

# not binned
plt.figure(figsize=(12, 6))
sns.lineplot(data = data, x = 'Year', y = 'Soil_Health_Index')
plt.show()

#binned
SHI_counts = (data_cleaned_1.groupby(['Year', 'Soil_Health_Index']).size().reset_index(name='Count'))

plt.figure(figsize=(14,6))
sns.lineplot(data=SHI_counts, x='Year', y='Count', hue='Soil_Health_Index')
plt.title("Trend of Soil Health Index by Category")
plt.ylabel("Number of Observations")
plt.show()

sns.histplot(data['Soil_Health_Index'])

Code
# Check to see if any impact from other vars

agri_vars = data_cleaned_1.loc[:, ['Year', 'Irrigation_Access_%', 'Pesticide_Use_KG_per_HA', 'Fertilizer_Use_KG_per_HA']]
agri_melted = agri_vars.melt(id_vars='Year', var_name='Agri_Var', value_name='Values')
g = sns.FacetGrid(agri_melted, col='Agri_Var', col_wrap=3, height=3, aspect=1, sharey=False)
g.map_dataframe(sns.lineplot, x='Year', y='Values')
g.set_titles("{col_name}")
g.set_axis_labels("Year", "")
plt.tight_layout()
plt.show()

b) Analysis 2

Code
# From the previous question, we saw the greatest trends in "Crop Yield", 
# "Extreme Weather Events", and "Soil Health Index". 
    # However, we will plot temperature changes on the map to see regional differences as a baseline
Code
# 1) Plotting temperature

region_pivot = data_cleaned_1.pivot_table(
    index = 'Region', columns = 'Country', values = 'Average_Temperature_C', aggfunc = 'mean'
).reset_index()

region_melt = region_pivot.melt(id_vars='Region',var_name = 'Country',value_name = 'Average Temp (C)')
region_melt = region_melt.dropna(subset=['Average Temp (C)'])

a = sns.FacetGrid(region_melt, col = 'Country',col_wrap=3, height=4, aspect = 1,sharex=False)

a.map(sns.barplot,'Region','Average Temp (C)')
a.set_xticklabels(rotation=45)
a.tight_layout()
plt.show();

# title as regional differences in temperature for each country

Code
# Now lets look at regional changes:
    # use og data to look at numeric changes

# want raw values from data but region names from data_cleaned_2:
raw_values = data[['Crop_Yield_MT_per_HA', 'Soil_Health_Index', 'Year']].copy()
raw_values['Region']  = data_cleaned_2['Region']
raw_values['Country'] = data_cleaned_2['Country']
raw_values = raw_values[['Country', 'Region', 'Year', 'Crop_Yield_MT_per_HA', 'Soil_Health_Index']]

change_df = raw_values.groupby(['Country', 'Region']).apply(
    lambda x: pd.Series({
        'Crop_Yield_Change': abs(x.loc[x['Year'].idxmax(), 'Crop_Yield_MT_per_HA']
                       - x.loc[x['Year'].idxmin(), 'Crop_Yield_MT_per_HA']),

        'Soil_Health_Index_Change':   abs(x.loc[x['Year'].idxmax(), 'Soil_Health_Index']
                       - x.loc[x['Year'].idxmin(), 'Soil_Health_Index'])
    })
).reset_index()

#display((change_df.sort_values(['Crop_Yield_Change'], ascending=False)).head(10))
    #gives us the top 10 regions with greatest change in crop yield

#display((change_df.sort_values(['Soil_Health_Index_Change'], ascending=False)).head(10))
    #gives us the top 10 regions with greatest change in soil health index

# Concatenate! (for the report)
top_crop = (
    change_df.sort_values('Crop_Yield_Change', ascending=False)
             .head(10)[['Country', 'Region', 'Crop_Yield_Change']]
             .reset_index(drop=True)
)

top_shi = (
    change_df.sort_values('Soil_Health_Index_Change', ascending=False)
             .head(10)[['Country', 'Region', 'Soil_Health_Index_Change']]
             .reset_index(drop=True)
)

# Place them side-by-side
combined = pd.concat([top_crop, top_shi], axis=1)

display(combined)


# we can't do this on extreme events because all regions have discrete values from 0-10
Country Region Crop_Yield_Change Country Region Soil_Health_Index_Change
0 Brazil Northeast(Brazil) 2.691 Brazil South(Brazil) 54.30
1 France Ile-de-France(France) 2.549 Russia Siberian(Russia) 49.48
2 Argentina Northeast(Argentina) 2.521 France Nouvelle-Aquitaine(France) 49.32
3 Russia Northwestern(Russia) 2.520 Argentina Northwest(Argentina) 45.44
4 Brazil North(Brazil) 2.069 Russia Northwestern(Russia) 43.93
5 Brazil Southeast(Brazil) 2.023 India Punjab(India) 42.70
6 USA South(USA) 1.878 Argentina Patagonia(Argentina) 42.04
7 India West Bengal(India) 1.701 India Maharashtra(India) 41.52
8 Argentina Northwest(Argentina) 1.700 Argentina Northeast(Argentina) 35.92
9 Russia Volga(Russia) 1.673 Argentina Pampas(Argentina) 32.90
Code
# First make a palette for each country
countries = change_df['Country'].unique()
palette = sns.color_palette("tab10", n_colors=len(countries))
palette_dict = dict(zip(countries, palette))

# now plot
#top_crop = change_df.sort_values('crop_change', ascending=True).head(10)

plt.figure(figsize=(8,6))
sns.barplot(
    x='Crop_Yield_Change',
    y='Region',
    data=top_crop,
    hue='Country',
    palette=palette_dict,
    dodge=False
)
plt.xlabel('Absolute Change in Crop Yield (MT/HA)')
plt.ylabel('Region')
plt.title('Top 10 Regions with Greatest Crop Yield Change')
plt.legend(title='Country')
plt.tight_layout()
plt.show()

Code

plt.figure(figsize=(8,6))
sns.barplot(
    x='Soil_Health_Index_Change',
    y='Region',
    data=top_shi,
    hue='Country',
    palette=palette_dict,
    dodge=False
)
plt.xlabel('Absolute Change in Soil Health Index')
plt.ylabel('Region')
plt.title('Top 10 Regions with Greatest Soil Health Index Change')
plt.legend(title='Country')
plt.tight_layout()
plt.show()

Code
# Now let's look at extreme weather events
    # because they are binned, we want to see which regions have the highest count of
        # very high levels
    # however, it might also be interesting to see which countries fluctuate the most 
        # between low and very high levels

# We can use a heatmap for this!

pivot = data_cleaned_2.pivot_table(
    index='Country',
    columns='Extreme_Weather_Events',
    aggfunc='size',
    fill_value=0
)

plt.figure(figsize=(10,8))
sns.heatmap(pivot, annot=True, cmap='Reds')
plt.title('Extreme Weather Events by Country')
plt.xlabel('Event Level')
plt.ylabel('Country')
plt.show()

# countries were used instead of regions here simply because there are too many regions
    # for the visualization to be clean

c) Analysis 3:

Code
# Here we want to see how climate variables correlate with crop yield:
    # Climate vars --> temp, precipitation, CO2 emissions, and extreme weather events

# from before we found that:
    # the first three vars do not vary significantly year-to-year
    # USA has the highest extreme weather levels

# For the first three variables, we can do a pairplot to see correlation maybe?
# For the extreme_weather_events, we can focus in on the USA or even Australia
Code
# 1) Climate variable correlation

cols = [
    'Average_Temperature_C',
    'Total_Precipitation_mm',
    'CO2_Emissions_MT',
    'Crop_Yield_MT_per_HA'
]

sns.pairplot(data_cleaned_3[cols])
plt.show()

Code
# Use a heatmap instead


cols = [
    'Average_Temperature_C',
    'Total_Precipitation_mm',
    'CO2_Emissions_MT',
    'Crop_Yield_MT_per_HA','Soil_Health_Index'
]

corr = data[cols].corr()

sns.heatmap(corr, annot=True, cmap = 'RdBu', vmin=-1, vmax=1)
plt.title("Correlation Between Crop Yield and Climate Variables")
plt.show();

Code
# Exploring temperature against crop yield further:

sns.scatterplot(
    data=data_cleaned_3, 
    x='Average_Temperature_C', 
    y='Crop_Yield_MT_per_HA'
)

plt.xlabel('Average Temperature (°C)')
plt.ylabel('Crop Yield (MT/HA)')
plt.title('Temperature vs Crop Yield')
plt.show();

Code
# What if we look at it for different crop types?
g = sns.FacetGrid(data_cleaned_1, col='Crop_Type', col_wrap=3, height=4)
g.map_dataframe(sns.scatterplot, x='Average_Temperature_C', y='Crop_Yield_MT_per_HA')
g.map_dataframe(sns.regplot, x='Average_Temperature_C', y='Crop_Yield_MT_per_HA', scatter=False)
g.set_titles("{col_name}")
plt.show()

# no trends present

Code
# Correlation between extreme weather events and crop yield

usa = data_cleaned_3[data_cleaned_3['Country'] == 'USA']

sns.boxplot(
    data=usa,
    x='Extreme_Weather_Events',
    y='Crop_Yield_MT_per_HA'
)
plt.title("Crop Yield Across Extreme Weather Categories in the USA")
plt.show();

#ALL countries/regions

sns.boxplot(
    data=data_cleaned_3,
    x='Extreme_Weather_Events',
    y='Crop_Yield_MT_per_HA'
)
plt.title("Crop Yield Across Extreme Weather Categories for All Countries")
plt.show();

Code
# Numerical data tables:

# USA summary using describe
usa_summary = usa.groupby('Extreme_Weather_Events')['Crop_Yield_MT_per_HA'].describe()
usa_summary['Range'] = usa_summary['max'] - usa_summary['min']
display("USA Crop Yield Summary by Extreme Weather Events")
display(usa_summary)

# All countries summary using describe
global_summary = data_cleaned_3.groupby('Extreme_Weather_Events')['Crop_Yield_MT_per_HA'].describe()
global_summary['Range'] = global_summary['max'] - global_summary['min']
display("\nGlobal Crop Yield Summary by Extreme Weather Events")
display(global_summary)
'USA Crop Yield Summary by Extreme Weather Events'
count mean std min 25% 50% 75% max Range
Extreme_Weather_Events
Low 270.0 2.196337 0.932528 0.513 1.530 2.1105 2.8755 4.58 4.067
Medium 273.0 2.214839 0.932913 0.459 1.512 2.1330 2.9000 4.68 4.221
High 274.0 2.201522 1.010211 0.468 1.395 2.1120 2.8530 4.89 4.422
Very High 215.0 2.366972 0.964928 0.570 1.658 2.3760 3.0045 4.90 4.330
'\nGlobal Crop Yield Summary by Extreme Weather Events'
count mean std min 25% 50% 75% max Range
Extreme_Weather_Events
Low 2755.0 2.236892 0.982763 0.468 1.45800 2.1600 2.951 4.94 4.472
Medium 2724.0 2.259086 1.001251 0.450 1.48875 2.1885 2.934 5.00 4.550
High 2696.0 2.224329 1.003056 0.450 1.43100 2.1500 2.892 4.99 4.540
Very High 1825.0 2.239447 1.010630 0.450 1.41300 2.2140 2.934 5.00 4.550

d) Analysis 4

Code
# A couple of subquestions:
    # How does economic impact relate to crop yield?


def summary_df(data, group_col, value_col):
    summary = data.groupby(group_col)[value_col].agg(['mean', 'std', 'count']).reset_index()
    summary['sem'] = summary['std'] / np.sqrt(summary['count'])  # standard error
    return summary


# How does economic impact differ across countries?
sns.barplot(
    x='Country',
    y='Economic_Impact_Million_USD',
    hue = 'Country',
    palette= 'hls',
    data=data_cleaned_4,
    dodge=False
)
plt.xticks(rotation = 45)
plt.show();

country_summary = summary_df(data_cleaned_4, 'Country', 'Economic_Impact_Million_USD')
print("Economic Impact by Country")
print(country_summary)


# How does economic impact differ across adaptation strategies?
sns.barplot(
    x='Adaptation_Strategies',
    y='Crop_Yield_MT_per_HA',
    hue = 'Adaptation_Strategies',
    palette= 'crest',
    data=data_cleaned_4,
    dodge=False
)
plt.xticks(rotation = 45)
plt.show();

strategy_summary = summary_df(data_cleaned_4, 'Adaptation_Strategies', 'Crop_Yield_MT_per_HA')
print("\nCrop Yield by Adaptation Strategy")
print(strategy_summary)

Economic Impact by Country
     Country        mean         std  count        sem
0  Argentina  674.055935  416.156121    984  13.266573
1  Australia  660.917752  418.982861   1032  13.042367
2     Brazil  675.135339  412.385134    944  13.421993
3     Canada  680.121931  410.546507    984  13.087745
4      China  674.689835  416.854107   1031  12.982393
5     France  663.225501  404.198313    978  12.924837
6      India  683.412673  418.990602   1025  13.087068
7    Nigeria  699.281147  429.024206   1029  13.374394
8     Russia  666.132914  417.119406    961  13.455465
9        USA  665.057074  400.749492   1032  12.474787


Crop Yield by Adaptation Strategy
     Adaptation_Strategies      mean       std  count       sem
0            Crop Rotation  2.271590  1.012447   1957  0.022886
1  Drought-resistant Crops  2.244425  0.991208   1995  0.022192
2            No Adaptation  2.237939  1.001804   2024  0.022268
3          Organic Farming  2.238188  0.982698   1975  0.022112
4         Water Management  2.209385  1.003231   2049  0.022163
Code
# How does economic impact relate to crop yield?
sns.scatterplot(
    x = 'Crop_Yield_MT_per_HA',
    y = 'Economic_Impact_Million_USD',
    data = data_cleaned_4,
)
plt.show();


#Less dense
sns.regplot(
    data = data_cleaned_4.sample(1000, random_state=42),
    x = 'Crop_Yield_MT_per_HA',
    y = 'Economic_Impact_Million_USD'
)

plt.show();